AD-A079  554 


UNCLASSIFIED 


MASSACHUSETTS  UNIV  AMHERST  DEPT  OF  MATHEMATICS  AND  S— ETC  F/G  12/1 
SIMILARITY  MEASURES  ON  BINARY  ATTRIBUTE  DATA  -  II. (U) 

DEC  79  M  F  JANOmIITZ  N00014«79-C-0629 

TR-J7902  Ml 


f 


I 


Technical  Report  Number  J79QZ^ 


SIMILARITY  MEASURES  ON  BINARY 
ATTRIBUTE  DATA  ^ 


M.  P.  Janowitz 

ment  of  Mathematics  &  Statistics 
UNIVERSITY  OP  MASSACHUSETTS 
Aniherst,  Massachusetts 


doc«ft' 
iox  ' 


tutted* 


December  1979 


MCUMTV  CLAStiriCATION  or  TmiS  PAGC  rmiwa  Dmta  e<ilm« 


REPORT  DOCUMENTATION  PAGE 


gipe  READ  mSTRUCTIONS 

_ BEFORE  COMPLETING  FORM 

2.  60VT  *CCCS:-ION  NO.  1-  WeCIPICMf*  CATAuOO  MUMGtA 


i-J79/)2 


14.  JiTtC  (m4  St0MiM 


^  Similarity  Measures  on  Binary  / 
Attribute  Data  -  II,  - 


y yg  agMoaT  a  mmmttia  f-Oug«go 

Technical 

PCRfORMtNO  ORC.  RKPORT  NUMBtR 


AuTMORr.J 


1.  F./ 


Janowi tz 


RCRFORMINO  oroahizatiom  name  ano  aoorcss 

University  of  Massachusetts 
Amherst,  MA  01003  z' 


(7^  ALj, 


\  t.  contract  OA  chant  MUMSCAf*) 


14-79-C-0629' 


to.  PROORAM  clement,  project,  TA$K 
AREA  •  »ORK  UNIT  NUMECRI 


121405 


•  I.  CONTROLLING  OmCE  NAME  ANO  AOORCSS  ORT  OSTl  ,  ■■■-  , 

Procuring  Contracting  Officer  /  DeciHHMIiP79/ 

Office  of  Naval  Research  u.  NUMRlpoppAod  ' 

Arlington.  VA  22217 _ 19 _ 

i*.  MONITORING  AGENCY  NAME  *  AOORCSS^R  dlff.MM  <MRi  Coniralllnf  OHIe*)  IS.  SECURITY  CLASS,  (ol  ihit  r*pwtj 

Office  of  Naval  Research  Resident 

Representative,  Harvard  University  Unclassified 

Gordon  McKav  Laboratory,  Room  113  is*,  occlassipicaiion/oowngra'. 

Cambridge,  MA  02138  »cheoole 

•S.  OtSTRiauTION  statement  (ct  Mt  Ka^crt) 


Unclassified 

IS*.  OCCLAStl^lCATiON/OOVNGRAOiNO 
SCHCOULC 


APPROVED  FOR  PUBLIC  RELEASE:  DISTRIBUTION  UNLIMITED. 


In.  OlSTMieuTlON  STATCMCNT  (of  tho  ototroct  mtoro^  In  AUcJl  20,  if  dHtoront  from  Roport) 


[IS.  SUf^^LCMCNTAAV  notes 


IS.  KEY  WOAOS  fCMiflnw*  on  rowotoo  oldo  It  nooooMfr  Idonllfr  Sr  Sloe*  n«ai6or^ 

Numerical  taxonomy.  Cluster  analysis.  Similarity  Measure, 
Special  clustering.  Optimality  measures,  Cophenetic 
correlation 


|M.  A^1  llACT  fConiMiwo  on  roooooo  oltfo  <1  nocoooorr  l#infllr  Sr  Sloe* 


The  exact  nature  of  the  coefficient  of  special 
sioilarity  is  inrestigated ,  and  its  ability  to 
recognize  Gilmour  natural  classifications  is  com¬ 
pared  to  that  of  various  other  similarity  measures 
by  means  of  a  computer  simulation  of  the  classifi¬ 
cation  problem. 


00  /jTn  1473 


COITION  0^  1  NOV^tS  IS  oosolctc 

S/N  0102*  LF.OU-  6401 


•CCuoity  classification  of  TNIS  ^AOC  fC 


^//  rn 


Similarity  Measures  On  Binary  Attribute  Data  -  II 


M.  F.  Janowita* 

This  report  has  its  genesis  in  the  work  by  Parris 
(1977)  In  which  a  claim  is  made  that  the  phylogenetic 
system  of  classification  is  superior  to  the  phenetic. 

I  pointed  out  certain  flaws  in  Parris*  reasoning 
( Janowita ,1979) •  and  Farris*  reply  appears  in  Parris  (1979)« 
I  enlarged  upon  my  riews  in  Janowits  (1979a),  and  the 
current  report  should  be  regarded  as  a  continuation  of 
that  work.  The  terminology  amd  notation  will  follow 
that  of  Janowitz  (1979a),  and  much  of  the  data  will  be 
taken  from  there.  Despite  this,  it  will  prove  useful 
to  briefly  redefine  the  similarity  measures  that  will 
be  considered,  and  restate  some  of  the  issues  that  are 
under  contention. 

The  input  data  is  a  set  of  p  objects  to  be  classified, 
together  with  a  set  of  n  attributes.  For  purposes  of  the 
present  work,  I  shall  assume  that  each  attribute  has  two 
states,  md  that  they  are  coded  0  and  1,  For  objects  J 
and  K,  one  can  now  define  the  following  symbols! 


symbol 

No,  attributes  for  which 

a 

J  and  K  have  state  1 

b 

J  has  state  1  and  K  has  state  0 

c 

J  has  state  0  and  K  has  state  1 

d 

J  and  K  have  state  0. 

♦Research  supported  by  ONR  Contract  N00014-79-C-0629  as 
well  as  by  grants  from  the ^University  of  Massachusetts 
Computer  Center. 


Here  then  are  the  similarity  coefficients  that  will  be 
considered! 


Similarity 

Formula 


Blssimllarity 
Formula _ 


Meas- 

■an. 


Nyae 


Simple 

Matching 


DCl 


*  <»)/n 


(b  ■»  c)/n 


DC2 


Jaccai 


Russell 

atf  Efto 


a/(a  .»  b  4.  c) 


(b  ♦  c)/(a  ■»  b  ♦  c) 


DC» 


3^ 


1  -  a/n 


DCIO 


Yule 


(ad  -  bc)/(ad  ♦  be) 


bc/(ad  ♦  be) 


For  binary  attribute  data#  PCI  represents  Farris  coefficient 
of  overall  similarity.  Letting  s  denote  the  similarity 
▼ersion  of  PCI*  let  us  see  now  how  Farris  defines  bis 
coefficient  of  special  similarity  (1977*826,836).  The  idea 
is  to  decide  that  for  each  attribute,  one  of  the  two  states 
is  uninf ormatire .  An  object  R,  called  the  reference  point, 
is  then  defined.  R  might  be  one  of  the  objects  already 
under  consideration,  or  it  might  be  a  new  object.  In  either 
ease,  R  has  the  property  that  for  each  attribute,  it  has 
the  jniformative  state.  To  compute  the  special  similarity 
coefficient  a,  one  then  uses  the  formula  (Farris, 1979*836) 


a(J,K)  »  ^{1  *  s(J,K)  -  s(J,R)  -  s(K,R)}. 

This  of  course  produces  the  similarity  version  of  at  to 
view  it  as  a  measure  of  dissimilarity,  one  simple  considers 
1  -  a(J.K). 


Here  is  a  ;;)ortion  of  the  argument  used  by  Farris  in 
Farris  (1977  snd  1979) •  stated  in  what  I  hope  is  an 
accurate  manner.  Pheneticists  favor  using  the  coefficient 


-3- 


of  overall  aiailarity*  and  have  also  favored  ualr/;  the 
cophenetic  correlation  coefficient  as  a  measure  of  how 
well  a  cluster  technique  really  works.  Farris  pri  sented 
a  set  of  fully  congruent  attributes  that  represerted  a 
Gilmour  natural  classification.  He  then  showed  that 
this  classification  was  recaptured  by  special  similarity 
but  not  by  overall  similarity.  He  concluded  that  over¬ 
all  similarity  could  give  a  wror^  classification.  He 
then  proceeded  to  establish  that  it  generally  gives 
the  wrong  classification  by  applying  the  two  coefficients 
to  a  large  number  of  real  life  data  seta.  In  each  case* 
the  cophenetic  correlation  coefficient  said  that  special 


similarity  was  far  superior  to  overall  similarity.  He 
concluded  that  since  the  optimality  measure  favored  by 
pheneticists  produced  the  result  that  a  phylogenetic 
method  was  superior  to  the  phenetic  method  favored  by 


Accession  For 


wrio  G^nii 

DDC  TA3 
tor.nn  ju..crd 

Lion 


these  same  people,  it  must  follow  that  the  phylogenetid-£^^-t''i.‘'' 
method  is  indeed  superior. 

I  argued  in  Janowitz  (1979  and  1979a)  that  specie 
similarity  does  not  do  a  very  good  job  of  recognizing 
natural  classifications  for  attribute  data  that  is  not 
fully  congruent,  and  that  the  cophenetic  correlation 
coefficient  cannot  be  used  as  a  measure  of  the  ability 
of  a  similarity  measure  to  recapture  such  classifications. 
Indeed,  it  was  argued  (1979a)  that  if  the  reference  point 
consists  of  all  0  states,  then  special  similarity  does 


4k. 


not  porfeni  aa  well  aa  either  simple  matehinf  or  Yule* a 
coefficient.  But  Farris  did  not  choose  his  reference 
point  in  this  manner.  Rather,  he  took  as  a  reference  point 
the  firat  object  he  came  to  (Farris  (1977* 8 36)  and 
(1979' 210)).  This  was  something  that  I  did  not  do  in  1979a. 
and  something  that  will  be  explored  in  the  present  work. 

In  section  1.  I  shall  take  a  careful  look  at  the  nature  of 
special  similarity,  while  in  sections  2  and  3*  I  shall 
compare  its  performance  with  various  other  similarity 
measures,  taking  the  reference  point  to  be  the  first 
object  of  the  set  of  object  to  be  classified.  The  compar¬ 
ison  will  be  made  both  for  fully  congruent  attribute  data, 
and  for  the  more  general  situation  where  various  types  of 
errors  are  introduced  into  the  attribute  data  -  the  idea 
being  to  calculate  the  ability  of  various  methods  to 
recapture  Gilmour  natural  classifications  when  they  can 
be  recognised. 


-5- 


^  1.  Th>  nature  of  BPeclal  siallaritv.  Letting  R  be 
the  reference  point*  let's  have  a  close  look  at  a(J*K) 
where  J«K  are  objects  to  be  classified.  Suppose  that 
there  are  n  attributes,  and  that  h2*^2*****^S  non¬ 
negative  integers  whose  sum  is  n.  Sujjpose  further  that 
the  following  table  describes  the  attribute  data  for 
R.J.Ki 

No.  attributes  ^2  ^3  ^4  ^6  ^7  ^6 

R  00001111 

J  01010101 

K  00110011 

Application  of  the  formula  for  a  shows  that 

a(J.K)  s  (h^  ♦  h^)/n. 

But  the  sane  effect  may  be  obtained  by  simply  recoding  the 

attributes  so  that  R  has  all  0  states,  and  then  applying 

DC4.  This  is  shown  by  the  recoded  data  table 

No.  attributes  hj^  hg  h^  h^^  h^  h^  h^  hg 

R  00000000 

J  01011010 

K  00111100 


From  this  viewpoint  one  can  instantly  see  a  major  difficulty 
in  the  interpretation  of  the  output  of  special  similarity. 


•  «. 


-6- 


If  this  Ir  viewed  as  presence-absence  data*  these 
attributes  then  reflect  the  nature!  classification 
whose  clusters  at  each  level  arei 
Level  1  12,  3,  k 

Level  2  123,  4 

Level  3  123^  . 

Taking  object  1  as  the  reference  point,  special  similarity 
acts  Just  like  DG4  on  the  data  matrix 

object  attribute _ 

1  0000000 

2  0001100 

3  iiooiolo 

4  IllOlOOl  , 

This  produces  the  classification 
Level  1  1,  2,  34 

Level  2  1,  234 

Level  3  1234 

A  coiiinon  reaction  to  all  of  this  seems  to  be  that  if  one 
views  the  output  of  special  similarity  as  an  unrooted 
tree,  then  the  desired  natural  classification  may  still 
be  recaptured.  The  problem  is  that  without  prior  know¬ 
ledge  of  the  desired  classification,  it  is  difficult  to 
see  how  to  reroot  such  a  tree  so  as  to  simultaneously 
remove  the  unwanted  cluster  3^*  end  produce  the  desired 
cluster  12.  Similar  examples  will  be  considered  in  the 
next  section. 


1 


Special  similarity  was  applied  to  the  data  sets  that 
appeared  as  Tables  6, 7,8, 9  of  Janowitz  (1979a).  The 
data  in  the  tables  attempt  to  reflect  the  following 
natural  classifications  (only  the  nontrivial  clusters 
will  be  listed  at  each  level)! 


Table 

6 

Table 

X. 

Level 

1 

12,34,56,78,9-10 

Level 

1 

12,  9-10 

Level 

2 

1-4,56,78,9-10 

Level 

2 

1-3,  8-10 

Level 

3 

1-6,  78,  9-10 

Level 

3 

1-4,  7-10 

Lfixel 

4 

1-8,  9-10 

Level 

4 

1-5.  6-10 

Level 

5 

1-10 

Level 

5 

1-10 

Table 

8 

Table 

_2_ 

Level 

1 

12.  45,  9-10 

Level 

1 

12,  3'*.  56,  78 

Level 

2 

1-3,  4-6,  8-10 

Level 

2 

1-4,  56,  7-10 

Level 

3 

1-6,  7-10 

Level 

3 

1-6,  7-10 

Level 

4 

1-10 

Level 

4 

1-10 

When  ST>ecial  similarity  is  applied,  the  result  in  each 
case  is  an  ultrametric.  Application  of  single  linkage 
clustering  now  produces  the  classifications  shown  in 
Pig.  1.  The  reader  should  cheek  for  himself  to  see  what 
a  poor  job  has  been  done  in  recapturing  the  desired 
natural  classifications.  Furthermore,  when  the 
classifications  of  Fig.  1  are  viewed  as  unrooted  trees, 
it  is  difficult  to  see  how  to  reroot  the  trees  so  as  to 
recapture  the  desired  classifications  unless  one  has 


-8- 


prior  knowledge  of  the  underlying  natural  classlf icationa. 
Thia  ia  eapecially  well  illuatrated  by  Pig,  1(b),  How 
would  one  know  that  thia  tree  should  be  rerooted  so  aa 
to  produce  the  claaaification  1-6,7-10,  for  example? 

There  really  aeema  to  be  no  way  out  of  thia  dilemma.  If 
one  chooaea  the  flrat  object  aa  a  reference  point,  then 
apecial  aimilarity  will  not  recapture  a  natural  claaa¬ 
ification,  eren  for  fully  congruent  attribute  datai  on 
the  other  hand,  if  one  takes  as  a  reference  point  an 
object  hawing  all  0  states,  then  special  similarity  works 
fine  in  the  fully  congruent  case,  but  as  was  shown  in 
Janowitz  (1^79a)»  it  does  not  perform  well  in  the  in- 
congruent  case.  It  remains  to  be  seen  how  special  similarity 
will  perform  in  the  incongruent  case  when  the  reference 
point  is  taken  as  the  first  object  to  be  classified. 

This  will  be  done  in  the  next  section. 


-9- 


^  3.  SP€clal  sliailarlty  on  Incongruent  input  data. 

As  a  test  of  this*  I  repeated  a  simulation  that  was  done 
in  Janowita  (1979a).  I  took  each  of  the  data  matrices 
from  Tables  6, 7, 8, 9  and  1  of  that  paper*  replicated  the 
characters  as  indicated  therein*  doubled  eacn  character 
introduced  a  ^  random  error  in  reading  character  state «» 
introduced  6  random  characters*  and  finally  discarded 
lOil  of  the  resulting  characters  in  a  random  fashion.  I 
used  special  similarity  as  well  as  DCl*  DC2*  DC4  and  DCIO 
followed  by  both  single  linkage  and  u*. S-clustering, 
Methods  1  and  4  of  1979a  were  then  used  to  measure  the 
ability  of  each  cluster  method  to  recapture  the  class¬ 
ification  produced  by  the  unperturbei  data.  In  all  cases 
except  that  of  special  similarity*  tiis  measures  the 
ability  of  the  cluster  method  to  recapture  the  desired 
natural  classification.  In  the  case  of  special  similarity* 
it  measures  how  well  the  faulty  classifications  of  Pig*  1 
are  recaptured.  Here  then  are  the  results*  based  upon  5 


trials 

cluster 

Table 

with  each  data 

^ng. 

Method  1 

set*  and  using 

Method  4 

single 

Coph. 

linkage 

corr. 

6 

Mean 

SD 

Mean 

SD 

DCl 

.8968 

.0429 

.0816?  .0753 

.8183 

.1112 

DC2 

.8671 

.0703 

.1143  .15^5 

.8290 

.1090 

DC4 

.7107 

.2617 

.2760  .2293 

.7448 

.1873 

DCIO 

.9137 

.1202 

.0875  .1166 

.7586 

.1909 

Spec 

.8612 

.0738 

.1265  .1140 

.9367 

.0192 

Table 

Method 

T - 

Method 

- 

Coph. 

corr* 

7 

Mean 

sb 

Mean 

SD 

Mean 

SD 

DCl 

.0377 

.0542 

.2279 

.0727 

.8488 

.0358 

DC2 

.9199 

.0387 

.2099 

.1602 

.8750 

.0474 

DC4 

.8503 

.0436 

.2212 

.1198 

.7918 

.0732 

DCIO 

.8368 

.0482 

.1469 

.1344 

.7707 

.0305 

Spec 

.0867 

.0520 

.0709 

.0470 

.9678. 

.0131 

Table 

_ 

DCl 

.8482 

.1285 

.1893 

.1948 

.8013 

.1209 

DC2 

.8879 

.0580 

.0923 

.0405 

.8610 

.0575 

DC4 

.6384 

.1147 

.4075 

.0860 

.7281 

.1281 

DCIO 

.8938 

.0980 

.0786 

.0242 

.7794 

.0642 

Spec 

.8870 

.0591 

.1137 

.0693 

.9536_ 

.0238 

Table 

9 

DCl 

.7769 

.1598 

.2596 

.1309 

.7858 

.0789 

DC2 

.7891 

.1172 

.2014 

.1779 

.8458 

.0395 

DC4 

.6012 

.1416 

.3990 

.1567 

.7482 

.0649 

DCIO 

.8073 

.1819 

.1705 

.1901 

.7330 

>0835 

Spec 

.0872 

.0448 

.1426 

.0948 

.9457 

.0217 

Table 

__1 _ 

DCl 

.8099 

.1135 

.2256 

.2012 

.7721 

.0790 

DC2 

.7586 

.0736 

.2746 

.1931 

.7967 

.0679 

DC4 

.5676 

.0358 

.4360 

.0453 

.7146 

.0644 

DCIO 

.7787 

.1744 

.1799 

.1460 

.7172 

.0716 

Spec 

*8216 

f  0^.56 

■  t085?- 

.0633 

.9585 

.0151 

The  data  in  the  above  table  pretty  well  confirm  resulte 


that  were  announced  in  1979a*  With  respect  to  the  coph- 
enetic  correlation  coefficient*  special  similarity  appears 
to  be  far  superior  to  any  of  the  others.  (Note.  This  is 
the  result  that  Farris  announced  in  1977  and  1979).  But 
it  also  does  well  with  respect  to  the  criteria  of  Methods  1 
and  4.  This  shows  that  it  does  a  food  Job  of  recapturing  its 
faulty  view  of  the  underlying  classifications*  and  consequently 


'f 


J 


-11 


must  do  a  poor  job  of  actually  recapturing  those  class¬ 
ifications*  Ignoring  the  performance  of  special  similarity* 
DCIO  is  best  in  3  out  of  5  cases  with  respect  to  the 
criterion  of  Method  1*  and  is  best  in  4  out  of  5  cases 
with  respect  to  Method  4*  Though  it  seems  pointless  to 
reproduce  the  results  of  these  25  trials*  it  still  is 
informative  to  reproduce  a  portion  of  the  actual  clusters 
produced  by  special  similarity.  The  interested  reader 
may  obtain  the  remainder  from  the  author  upon  request. 

Here  then  are  the  results  from  the  data  of  Table  6. 


syel 

Trial  2 

Trial  3 

Trial  4 

Trial  5 

1 

9-10 

9-10 

9-10 

78 

9-10 

2 

78*9-10 

78,9-10 

78*9-10 

56*78 

5  9-10*78 

3 

56*78  9-10 

7-10 

7-10 

56,78*9-10 

5  7-10 

4 

56*7-10 

34*7-10 

5-10 

5-8*9-10 

5-10 

5 

5-10 

34*6-10 

4-10 

23  5-10 

3-10 

6 

3-10 

3-10 

2-10 

2-10 

2-10 

7 

2-10 

2-10 

1-10 

1-10 

1-10 

8 

1-10 

1-10 

The  above  table  indicates  the  nontrivial  clusters  only. 
These  results  should  be  compared  with  the  desired 
natural  classifications  that  appear  on  p.  7. 

The  same  simulation  was  then  performed  using  Us.5** 
clustering  in  place  of  single  linkage  clustering*  and 
here  are  the  results. 


-12- 


crniEa 

Method 

“4 - 

Coph. 

Corr. 

HjgH 

FTTm 

SD 

Mean 

§D 

Mean 

SD 

DCl 

.9250 

.0222 

.1523 

.0717 

.8713 

.0328 

DC  2 

.9114 

.0184 

.1648 

.1134 

.8958 

.0285 

DC4 

.8746 

.0433 

.1727 

.1218 

.8317 

.0539 

DCIO 

.9058 

.0839 

.1156 

.0899 

.8363 

.0586 

Spec 

.1903 

.0952 

.9538 

.0316 

.0544 

DCl 

.9153 

.0919 

.3  067 

.8474 

DC2 

.9220 

.0646 

.1156 

.1604 

.8955 

.0352 

DC4 

.8979 

.0906 

.1588 

.1638 

.8480 

.0375 

DUO 

.9702 

.0160 

.0455 

.0172 

.8300 

.0487 

S'lec 

.9424 

.0136 

.0541 

.0356 

.9744 

.0081 

Table 

8 

D:1 

.7584 

.1995 

.1887 

.1649 

.7820 

.1241 

d:2 

.7332 

.2699 

.2790 

.2264 

.8031 

.1252 

DC4 

.3979 

.1498 

.4747 

.0949 

.6080 

.1928 

DCIO 

.7829 

.2439 

.2008 

.2073 

.7213 

.1181 

Sp4c 

.9408 

^0.227 

.0786 

.0422 

.9434 

.1145 

nm 

DCl 

.9119 

.0591 

.1020 

.0814 

.8496 

.0515 

DC2 

.8619 

.0891 

.1283 

.1279 

.8833 

.0323 

tC4 

.8366 

.1196 

.1586 

.1280 

.8131 

.0351 

CCIO 

.8987 

.0855 

.1378 

.1471 

.1231 

.8268 

.0434 

Soec 

.9010 

.0661 

.1879 

.9582 

.0124 

Table 

1 

ECl 

.8432 

.1570 

.1481 

.1468 

.7886 

.0744 

E‘:2 

.1729 

.1540 

.1506 

.8433 

.0575 

E14 

.6223 

.2914 

.3233 

.2407 

.7027 

.1292 

DUO 

.7947 

.2287 

.2132 

.1796 

.7496 

.1080 

-Spec 

.1074 

.1122 

•  972$ 

.0094 

The  situation  is  similar  to  that  of  the  earlier  case. 
Special  siallari'^  has  by  far  the  highest  cophenetic 
correlation,  while  DC4  is  significantly  the  lowest. 

Sirse  special  similarity  also  performs  pretty  well  with 
respect  to  the  criteria  of  Methods  1  and  4,  what  this  shows 
is  that  it  does  a  very  good  job  of  reflecting  a  class- 
if j  cation  that  is  not  the  one  that  we  are  after.  If  one 
ignores  special  similarity*  then  DCl  is  best  in  3  out  of 


-13- 


the  5  cases  with  respect  to  the  criterion  of  Method  1, 
and  is  best  in  4  out  of  5  cases  with  respect  to  Method  4. 
In  all  of  these  instances,  DC4  is  the  worst.  Thus  whether 
one  regards  the  first  object  as  the  reference  point  or  one 
takes  the  reference  point  to  have  all  0  states,  special 
similarity  is  the  worst  choice  of  these  dissimilarity 
measures  with  respect  to  its  ability  to  recapture  a 
natural  classification  in  the  presence  of  errors. 

Having  established  that  special  similarity  is  not  a 
particularly  good  choice  as  a  similarity  measure,  at 


least  for  this  particular  simulation,  I  would  like  to 
close  this  section  by  presenting  5  more  trials  on  each 
data  set.  This  time,  only  DCl,  DC2  and  DCIO  are  involved. 


-14- 


H«re  it  should  be  noted  that  Method  1  shows  DCIO  to  be 
best  on  4  out  of  the  5  data  sets.  Method  4  shows  DCIO 
to  be  best  on  3  oi*  the  5  aets.  while  the  cophenetic 
correlation  coefficient  produces  quite  different  results 
in  that  it  makes  DCIO  the  worst  choice  in  4  out  of  the 
5  sets.  Aicain  we  hare  proof  that  the  cophenetic 
correlation  coefficient  does  not  provide  a  meaningful 
criterion  for  measuring  the  performance  of  a  similarity 
measure. 


-15- 


4*  Conclusion.  If  one  decides  that  't  is  important 
for  cluster  methods  to  be  able  to  recoil  ze  natural 
classifications  in  the  presence  of  certain  types  of  error# 
then  the  results  that  were  presented  in  Janowitz  {1979a) 
and  the  present  paper  seem  to  indicate  that  snecial 
similarity  is  a  very  poor  choice  for  a  similarity  measure, 
while  both  simple  matchine:  and  Yule's  coefficient  seem 
to  be  quite  good.  One  must  exercise  extreme  caution,  how¬ 
ever.  in  attempting  to  draw  conclusions  such  as  this  from 
specific  data  sets.  The  results  could  very  well  be  data 
dependent!  The  only  safe  conclusion  that  can  be  drawn 
(hence  the  only  conlcusion  that  I  shall  draw)  is  that 
the  results  cast  considerable  doubt  on  any  assertion 
that  special  similarity  is  superior  to  the  simple  matching 
coefficient.  The  matter  should  still  be  reitcarded  as  open. 
One  can  of  course  use  these  results  as  an  indicator 
that  special  similarity  and  Yule’s  coefficient  might  be 
reasonable  candidates  for  use  as  a  measure  of  similarity, 
but  at  the  moment  that  is  all  I  dare  say. 

What  then  is  it  that  Farris  has  shown?  He  took  a 
phylogenetic  cluster  method,  applied  a  measure  of  optimality 
that  has  in  the  past  been  favored  by  pheneticists,  and 
showed  that  in  a  large  number  of  trials  with  different  data 
sets,  his  phylogenetic  method  was  deemed  superior  to  the 
method  favored  by  the  pheneticists.  I  have  Indicated  in 


-16- 


gr«ftt  d«tall  It  la  that  one  cannot  conclude  that  this 
ahowa  the  phylogenetic  aethod  to  be  auperlor  to  any 
phenetic  Method.  What  it  doea  ahow  ia  that  the  optimality 
meaaure  faila.  Thia  ia  the  heart  of  what  Parris  has 
demonstrated.  I  have  little  quarrel  with  his  data  - 
only  with  the  conclusion  that  he  has  drawn  from  that  data. 

In  a  later  paper*  I  shall  examine  the  question  of  just 
why  it  is  that  Yule's  coefficient  performs  well  on  the 
type  of  data  that  I  have  been  examining.  I  shall  also  make 
soae  concrete  suggestions  as  to  which  similarity  measures 
ought  generally  to  be  used. 


-19 


REFERNCBS 

Parrist  J«  S.  1977*  On  the  phenetic  approach  to  vertebrate 
classification.  In  Hecht*  M.K. »  P.  C.  Goody  and  B.  M. 
Hecht  (eds.).  Major  patterns  in  vertebrate  evolution. 
Plenun*  New  York,  pp.  823-850. 

Parris,  J.  S.  1979.  On  the  naturalness  of  phylogenetic 
classifications.  Syst.  Zool.  28i 200-214. 

Janowit?.,  M.  P.  1979.  A  note  on  phenetic  and  phylogenetic 
classifications.  Syst.  Zool.  281I97-I99. 

Janowit;;,  M.  P.  1979a.  Similarity  measures  on  binary 

attribute  data.  University  of  Massachusetts  Technical 
Reoort  No.  J7901. 


Department  of  Mathematics  and  Statistics 
University  of  Massachusetts 
Amherst,  MA  01003 
USA 


